Data disambiguation systems and methods

ABSTRACT

Various embodiments provide a state-based, regular expression parser in which data, such as generally unstructured text, is received into the system and undergoes a tokenization process which permits structure to be imparted to the data. Tokenization of the data effectively enables various patterns in the data to be identified. In some embodiments, one or more components can utilize stimulus/response paradigms to recognize and react to patterns in the data.

The present application is a continuation of copending U.S. patentapplication Ser. No. 10/839,425, filed May 4, 2004.

TECHNICAL FIELD

This invention relates generally to data disambiguation. Moreparticularly, the invention pertains to systems, methods and softwarearchitectures that are directed to pattern processing and recognition inthe context of generally unstructured data.

BACKGROUND

There is a great deal of so-called unstructured data that resides in theworld. Typically, unstructured data has characteristics which, as thename implies, find it highly unstructured and difficult to work with.Perhaps a good perspective from which to understand unstructured data isfrom the perspective of structured data. Structured data, by its verynature, is typically easily indexed and searched.

As an example, consider the following. In many cases, governments,corporations, and various other large entities such as businesses andthe like, can have many thousands of documents to deal with. Thesedocuments constitute knowledge in the sense that the documents containinformation that might be useful to the particular entity. Yet, byvirtue of the voluminous number of documents and the fact that suchdocuments may be in a generally unstructured state, this knowledge isnot reasonably and readily attained by these entities. Even if suchentities were to have, for example, an intranet, one would have to knowwhat to specifically search for, and what the information means to thesearcher.

Thus, as noted above, one of the difficulties in working withunstructured data is that of building and creating knowledge based onthe unstructured data. Put another way, one of the challenges withunstructured data pertains to disambiguating the data so that the datacan be the subject of meaningful information processing techniques.

Some approaches that have been used in the past in an attempt todisambiguate unstructured data utilize so-called knowledge architects.Knowledge architects are typically very highly skilled professionals whocraft knowledge based on the data. The techniques and approaches thatthese individuals use tend to be very expensive—owing to thehighly-skilled nature of the individual(s) architecting the system.Additionally, the specific systems that are put in place by suchindividuals do not tend to be easily repeatable in different scenariosor environments. Thus, these approaches tend to be expensive and highlyspecifically directed to a particular problem at hand. As such, thereremains a need, in the area of data disambiguation, for systems that areless complex insofar as implementation and deployment are concerned. Inaddition, there is a need for such systems that do not require a highlyspecialized professional to set and deploy the system.

SUMMARY

Various embodiments provide a state-based, regular expression parser inwhich data, such as generally unstructured text, is received into thesystem and undergoes a tokenization process which permits structure tobe imparted to the data. Tokenization of the data effectively enablesvarious patterns in the data to be identified. In some embodiments, oneor more components can utilize stimulus/response paradigms to recognizeand react to patterns in the data.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrates components of a system inaccordance with one embodiment.

FIG. 2 is a block diagram that illustrates components that can be usedfor conducting lexical analysis in accordance with one embodiment.

FIG. 3 is a block diagram that illustrates software components in asystem in accordance with one embodiment.

FIG. 4 is a block diagram that illustrates a system in accordance withone embodiment.

FIG. 5 is a block diagram that illustrates a system in accordance withone embodiment.

FIG. 5 a is a block diagram that illustrates steps in a method inaccordance with one embodiment.

FIG. 6 is a block diagram that illustrates a system in accordance withone embodiment.

DETAILED DESCRIPTION

Overview

As an overview, various embodiments described in this document utilize astate-based, regular expression parser that is designed to deal withlanguage, text and text types. In accordance with at least oneembodiment, data such as text is received into the system and undergoesa tokenization process which permits structure to be imparted to thedata. As the data undergoes the tokenization process, portions of thedata (e.g. individual words of text) are assigned different types. As anelementary example consider that individual words can be considered asparts of speech—such as nouns, verbs, prepositions and the like. Thus, avery elementary system might be set up to tokenize individual wordsaccording to their respective part of speech. Perhaps a better exampleis to consider that different nouns can be tokenized as types of nouns,e.g. places, dates, email addresses, web sites, and the like.

Tokenizing data creates patterns in the language which, in turn, canallow simple key words searches or searching for different type objectssuch as date objects, place objects, email address objects and the like.The tokenization process is effectively a generalized abstractionprocess in which typing is used to abstract classes of words intodifferent contexts that can be used for much broader purposes, as willbecome apparent below.

FIG. 1 shows, generally at 100, an exemplary software architecture orsystem that can be utilized to implement various embodiments describedabove and below. The software architecture can be embodied on anysuitable computer-readable media.

In this particular example, system 100 comprises a functional presenceengine 102, one or more knowledge bases 104 and, optionally, aninformation retrieval module 106. In accordance one embodiment, system100 receives unstructured data and processes it in a manner that impartsa degree of useful structure to it. The output of system 100 can be oneor more of structured data and/or one or more actions as will becomeapparent below. Each of these individual components is discussed in moredetail below under their own respective headings.

Functional Presence Engine

In accordance with one embodiment, the functional presence engine 102 isimplemented as a probabilistic parser that performs lexical analysis,using lexical archetypes, to define recognizable patterns. Thefunctional presence engine can then use one or more stimulus/responseknowledge bases, such as knowledge bases 104, to make sense of thepatterns and react to them appropriately. In accordance with oneembodiment, system 100 can learn or be trained by either changing thelexical archetypes and/or the knowledge bases.

Lexical Analysis

The discussion below provides but one exemplary implementation exampleof how lexical analysis can be performed in accordance with thedescribed embodiment. It is to be appreciated and understood that thedescription provided below is not intended to limit application of theclaimed subject matter. Rather, other approaches can be utilized withoutdeparting from the spirit and scope of the claimed subject matter.

In accordance with one embodiment, lexical analysis is performedutilizing a system, such as the system shown generally at 200 in FIG. 2.System 200 comprises, in this embodiment, an external .lex file 202which specifies a series of rules and their output symbols, a program204 to read the .lex file and convert it into a program which, in thisexample comprises a C++ lex-program, a lexical analysis program 206which, when provided with data such as text, produces tokenized contentin the format specified in the lex file, and an independent regularexpression library 208.

The .lex File

In accordance with one embodiment, the .lex file 202 comprises astructure having two component parts: a macro section and one or morelex sections. In the illustrated and described embodiment, the .lex fileis case sensitive, as are the regular expressions embodied by it. Themacro section specifies symbol rewrites. The macro section is used tocreate named identifiers representing more complicated regularexpression patterns. This allows the author to create and re-use regularexpressions without having to rewrite the same patterns in more than oneplace. Macros keep the lex sections cleaner and allow common expressionsto be changed in only one place. As an example, consider the following.

%macro regular_expression → macro-name regular-expressionl → macro-name1

This is a valid example macros section.

%macros          // begin macros \t\n\f\r,’- →wb //macros! \!\?\:\;\.”→sb  //more macros

With respect to the lex section, consider the following:

%lex optional_name  regular_expression1 → output_specifier[,output_specifier...]  regular_expression2 → output_specifier[,output_specifier...]  regular-expression3 →output_specifier[,output_specifier...] optional name → output_specifier[,output_specifier...]

“% lex” denotes the beginning a section of lexical rewrite rules. Insome cases it is desirable to specify a name. This is explored in moredetail below. On the lines following the “% lex” tag, a series of rulesare specified. These rules specify a regular expression followed by aseries of output symbols. As an example, consider the following:([[: alpha:]]+)[:wb:]+→WORD{1}

The left hand side of this expression is a regular expression. In thisexample, notice a “:wb:” on the left hand side which specifies a macro.Macros are specified using the format “:macro-name:”. A preprocessorwill substitute the macro value wherever it finds a macro namesurrounded by colons. A special case construct is when the ruleexpression matches the name specified in the “% lex tag”. This is a passthrough rule, meaning that if no other rule matches, this default rulewill consume the entire text and call the output specifiers with theentire text. There are some cases where this is useful, such as when the% lexer will never be a top level program. In accordance with oneembodiment, a known regular expression engine is utilized and isreferred to as the public domain engine PCRE 3.9, which will be known bythe skilled artisan.

Continuing, after the regular expression appears a “→” followed by aseries of output specifiers. In the above example, a match of the givenregular expression produces the output symbol, “WORD” and the outputtext {1}. The brackets and numeric identifier are optional. Thesespecify which sub-expression is output with the symbol. In theillustrated and described embodiment, sub-expressions are the text whichmatches regular expressions within parentheses. In this example, thetext which matches ([[:alpha:]]+) would be output along with the token“WORD”. If the above example were changed to:([[:alpha:]]+)[:wb:]+→WORD

then the output token would be the same, but then the entire match wouldbe returned as the text. This is same as writing “WORD{0}”. As anotherexample, consider the following:([[:alpha:]]+)([:sb:]+)→WORD{1}, EOS {2}//WORD and EOS

The example pattern above matches alpha characters, followed by themacro :sb:, which is defined in our example to be sentence boundarytokens. When text followed by a period occurs, two tokens are output—theWORD token and an end of sentence (EOS) token. This demonstrates how asingle match can produce more than one token. There is no limit on thenumber of tokens which can be output, except as guided by practicality.As another example, consider the pattern appearing just below:[^:wb:]+→GWORD

This pattern looks for any character that is not a word boundarycharacter and outputs a GWORD token, and the output text is the entirematch.

Putting the entire lex construct together, consider the following:

%lex main ([[:alpha:]]+)[:wb:]+ →WORD{1} ([[:alpha]]+)([:sb:]+)→WORD{1}, EOS {2} // WORD and EOS [{circumflex over ( )}:wb:]+ →GWORD//generic graphic word

In this particular example, when the lexer runs, it chooses the rulewhich matches the most text as the rule which will trigger the outputtoken. Options may be added later to control this behavior. This lexerwill output text words, end of sentence markers, and graphic words.

For handling large volumes of text, it is important to keep the mainlexer simple. That said, in some scenarios, it can be desirable totokenize things such as EMAIL, MONEY, IP addresses and URLs. Thefollowing simple rules are provided as an example of rules that tokenizesuch things.

  ([a-zA-ZO-9._-]+)@(([a-zA-ZO-9._-]+\.)+[a-zA-ZO-9._-]{2,3}) →EMAIL  [$]([\d]+\.[\d]*)  →MONEY   [\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3} → IP   ((http|https)://)?([a-zA-Z0-9._-]+\.)+[a-zA-Z0-9]{1,3} → URL

To address efficiency and performance issues, format utilized forlexical analysis can add some additional constructs. Recall from abovethat the file can specify one or more % lex constructs. This being thecase, consider the following.

Instead of putting the four rules listed above into the “main” lexer,the rules can instead be added to a sub-lexer as follows:

%lex GWORD ([a-zA-Z0-9._-]+)@(([a-zA-Z0-9._-]+\.)+[a-zA-Z0-9]{2,3})→EMAIL [$]([\d]+\.[\d]*) →MONEY[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3} → IP((http|https)://?([a-zA-Z0-9._-]+\.)+[a-zA-Z0-9]{1,3} → URL

Using this format, the entire file would look as follows:

%macros // begin macros \t\n\f\r ,'- -> wb  // macros! \!\?\:\;\.\” ->sb  // more macros %lex main ([[:alpha:]]+)[:wb:]+ → WORD{1}([[:alpha:]]+)([:sb:]+)  →WORD{1},  EOS{2} // WORD and EOS [{circumflexover ( )}:wb:]+ →GWORD{0} // generic graphic word %lex GWORD([a-zA-Z0-9._-]+)@(([a-zA-Z0-9._-]+\.)+[a-zA-Z0-9]{2,3}) → EMAIL[$]([\d]+\.[\d]*) → MONEY [\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3} →IP ((http|https)://)?([a-zA-Z0-9._-]+\.)+[a-zA-Z0-9]{1,3} → URL

In this file, there are two lex programs. Generally, the “main” lex isthe only lexer executed at the top level of the text tokenizationprocess. The rules under the % lex GWORD in general will not execute.However, make note of the fact that the rule “[^:wb:]+→GWORD{0}” has theoutput token of GWORD and note that the new % lex construct has the nameGWORD. This specifies a recursive lex procedure. When GWORD is thematched construct from “main”, that is, no other rule matches more text,before outputting GWORD, it will first try to match all the lexicalrules under the % lex GWORD tag. This is analogous to a procedure callin a programming language. The data that gets passed in is the textspecified in the output—in our case GWORD{0}, the entire matched text.From a performance standpoint, there is only a performance hit when wefind special graphic words. For alpha-only words, the GWORD lexer willnot run.

In addition to the constructs described above, in at least oneembodiment, other constructs can be utilized. These constructs cancontrol which lexers lexically process the data first. As an example,consider the construct “% push lexer-name, % pop”. In accordance withone embodiment, the lexer program can maintain a stack of lexers. Lexerswhich are on the execution stack are evaluated by a top level parser.Lexers which are not on the execution stack are not evaluated unlessrecursive tokenization occurs. It is possible to push new lexers ontothe stack in order to read specific data and then pop them whenfinished, as will be appreciated by the skilled artisan.

To demonstrate this, the code listing below is a text representation ofthe actual .lex file parser in the lex language. The main lexer iseither the % lex named “main” or the first % lex encountered in thefile. In the present example, both conditions are satisfied by thefirst-encountered lexer. In the illustrated and described example, the %lex main program looks for “% macro” specifiers or “% lex” specifiers,comments, or extra white space.

When a % macro is encountered, it emits a symbol “MACRO”, then popsanything on the stack, and then pushes the % lex READ_MACRO program ontothe top of the stack. The Rules in % lex READ_MACRO will now get thefirst chance to evaluate the incoming data or text. If READ_MACRO failsto match, then % lex main will also have an opportunity to evaluate theincoming data or text.

When a % lex is encountered, the same process occurs, except the topprogram becomes READ_LEX. READ_LEX looks for rules and, if encountered,it tokenizes the REGEX of the rule, and then pushes READ_LEX_RULE toread the right hand side of the rule. This demonstrates the recursivecapabilities of the system. The program, on certain input conditions,triggers a state change to a specialized sub lexer which is capable ofparsing a specific type of data. The sub lexer will process the data andthen perform a % pop operation when the sub lexer has completed it'stask.

If READ_LEX_RULE encounters some non-white, non-comment text, it gathersit, and calls the LEX_TOKEN program with the gathered text. LEX_TOKENlooks for % push, % pop, xxx{digit}, or xxx. In the illustrated anddescribed embodiment, LEX_TOKEN is not on the stack though. Rather, itis sub-component that is executed based on the text gathered by theparent, as described above.

Consider now the additional construct “% lex default”, which is aprogram that is specified at the bottom of the code listing below. Inaccordance with one embodiment, these constructs will only execute ifthe text cannot be tokenized using the execution stack. In the presentexample, this program is utilized to indicate a syntax error.

// This program is the lex specification for a lexer that // reads the“lex” file. // main lexer %lex main \%macros[ \t]*   -> MACRO, %pop,%push READ_MACRO \%lex[ \t]*([A-Za-z0-9_]*)[ \t]* -> LEX{1}, %pop, %pushREAD_LEX //.* -> // ignore [ \t\r\n]+ -> // ignore // read macro lexer%lex READ_MACRO [ \t]*(.+)[ \t]*−\>+[ \t]*([A-Za-z0-9_]*) ->MACRONAME{2}, MACROVALUE{1} // read lex rules %lex READ_LEX [ \t]*(.+)[\t]*−\> -> RULE_RE{1}, %push READ_LEX_RULE // read a single lex rule%lex READ_LEX_RULE [ \r\n] -> %pop  // done reading a rule [ \t]*//.* ->%pop [ \t[:alnum:]_\{%\}]+ -> LEX_TOKEN{0} // recursive descent into //%lex LEX_TOKEN with the string , ->  // ignore // read a single lextoken %lex LEX_TOKEN // lex_token program // this is called to perform asubrecognition on lex token output forms // it is not in the top levelparser stack [ \t]*%push[ \t]+(.*)[ \t]* -> STACK_PUSH{1} [ \t]*%pop[\t]*$”, -> STACK_POP [ \t]*([A-Za-z0-9_]+)\{[ \t]*([0-9]+)[ \t]*\}[ \t]* -> TOKEN{1}, TOKENPARAM{2} [ \t]*([A-Za-z0-9_]+)[ \t]*$  -> TOKEN{1}%lex default // default is ONLY hit when no other lexers on the stackevaluate // in this case we want to spit a syntax error // at thispoint, any text is considered a syntax error .* -> SYNTAX_ERROR

How .lex Matches

In accordance with one embodiment, under a given .lex program, thedefault methodology is to attempt to match all the regular expressionsin the % lex group and choose the rule which consumes the most input. Inaccordance with one embodiment, however, a program directive “% pragma”can be utilized to specify behaviors for the analysis system. Forexample, a % pragma firstmatch before the % lex tag indicates that thematching behavior should be to choose the first rule which successfullymatches the incoming text. This can improve performance but cansignificantly impact the matching process.

The syntax of this program directive is: % pragma pragma_name. Thefollowing pragmas (case-sensitive) are currently defined:

Pragma Definition % pragma The % lex tags and rules below this % pragmaare firstmatch matched using the “first match” strategy. That is, thefirst rule which is able to match the incoming text is the rule whichwill fire, other rules are ignored. This is for performance. It is notthe default behavior. % pragma The % lex tags and rules below this %pragma are bestmatch matched using the “best match” strategy. All therules of a particular % lex group are used. The rule which matches thelongest string will be the rule which is fired. This is the defaultbehavior.

Implementation Details

The following discussion is provided to describe one particularimplementation example of the system shown in FIG. 2. This example isnot intended to limit application of the claimed subject matter tospecifically described example. Rather, this example is provided as aguide to the skilled artisan as but one way certain aspects of thedescribed embodiments can be implemented.

First, in accordance with one embodiment, to utilize the lexer or theregular expression engine singularly, the user should consider thefollowing classes, each of which is discussed below under its ownseparate heading:

-   -   CRegex—regular expression engine. This class allows the user to        set the regular expression and then search a string of data for        the expression.    -   lex_program—a C++ implementation of the features provided in the        .lex file format    -   lex_program_compiler a compiler that produces a lex_program from        a .lex stream    -   lextoken—output data from the lex_program. An individual token        of data.

CRegex

In accordance with one embodiment, this class is a self-contained classfor matching strings of text to a regular expression. Like all C++classes in the Lex library, CRegex supports value class semantics,assignment, and copy construction. All of these operations are valid andtested.

In accordance with one embodiment, the member methods in this classinclude a compile method, a match method, a getMatches method, and aGetLastError method, each of which is described below.

CRegex::compilebool compile(const char* szRE, int flags, long* pFailureOffset=NULL);

This method compiles the specified regular expression, given in szRE,and flags.

Parameters

szRE—[in] pointer to a perl5 compatible regular expression

flags—[in] modifier flags for compilation

lexer::anchored The pattern is forced to be “anchored”. That is, thepattern is constrained to match only at the start of the string which isbeing searched (the “subject string”). This effect can also be achievedby appropriate constructs in the pattern itself. lexer::caseless Lettersin the pattern match both upper and lower case letters. lexer::dollarendA dollar metacharacter in the pattern matches only at the end of thesubject string. Without this option, a dollar also matches immediatelybefore the final character if it is a new line (but not before any othernew lines). This option is ignored if lexer::multiline is set.lexer::multiline By default, CRegex treats the subject string asconsisting of a single “line” of characters (even if it actuallycontains several new lines). The “start of line” metacharacter({circumflex over ( )}) matches only at the start of the string, whilethe “end of line” metacharacter ($) matches only at the end of thestring, or before a terminating new line (unless lexer::dollarend isset).

pFailureOffset—[in, out] a pointer to a integer variable that willreceive the offset if the string failed to compile. This may be usefulfor custom error handling. The return value is undefined if thecompilation succeeds.

Return Value

bool—true if the compilation was successful, false otherwise. UseCRegex::GetLastError( ) to retrieve a more detailed error message.

CRegex::matchlong match(const char* szText, long in Len, int flags=0)

This method attempts to match the compiled regular expression stored inthe class object with the specified text given the length of text. Itreturns the number of characters consumed by the match. It will return 0if the match failed. As an example, if “this” is the text to match andthe string is as follows, “blah this is fun”, then match will return“9”—the position past the last match.

To retrieve more detailed information about the match, the methodCRegex::getMatches, described below, can be utilized after performingthe match.

Parameters

szText—[in] data to match against

in Len—[in] number of characters in the szText to analyze. Use −1 ifszText is null terminated and you wish to match up to the end of string.

flags—[in]

Lexer::notbol The first character of the string is not the beginning ofa line, so the circumflex metacharacter should not match before it.Setting this without lexer::multiline (at compile time) causes{circumflex over ( )} never to match. Lexer::noteol The end of thestring is not the end of a line, so the dollar metacharacter should notmatch it nor (except in multi-line mode) a newline immediately beforeit. Setting this without lexer::multiline(at compile time) causes dollarnever to match.

Notes

The return value is somewhat unintuitive. It returns the pointer to thenext character after the end of the matched text. Just note that anon-zero value means there was a match. To get specific informationabout exactly where the match occurs, call CRegex::getMatches anytimeafter the call to CRegex::match.

Returns

long—number of characters consumed by the match process—0 if no match.

CRegex::getMatcheslong getMatches(int** ppMatches);

Call this method after calling CRegex::match to get a pointer to thelist of matches (submatches). It returns, in ppMatches, a pointer to theinternal list of matches retrieved after the last match. It also returnsthe number of valid matches.

Each call to match only matches the regular expression once. Callerswill need to iterate to find all the particular matches. The getMatchesmethod returns positional information about where the match occurred inthe text. The first two integers specify the start position and endposition of the whole match. The next “n” integers return position ofall submatches in the source string. As an example, consider thefollowing:

RE: ([[:alpha:]]+)([\d]+) Subject string: “Mark123 is here”CRegex::match returns: 7 indicating success getMatches returns 3. Thislist of integers looks like this: [0] - 0 [1] - 7 [2] - 0 [3] - 4 [4] -7

Parameters

ppMatches—[out] a live pointer to the matches. This pointer dies withthe class or when the matcher is recompiled.

Returns

The number of matches: 1 (whole match)+number of submatches. 0 if therewas no match in the last call to CRegex::match

Notes

The returned match list is a class object which goes out of scope withthe class, or when the CRegex::compile is called.

CRegex::GetLastErrorstd::string GetLastError( ) const;

This method returns the compilation error if any. It returns a stringthat specifies where in the regular expression the compilation failedand is useful for debugging compilation errors.

lex_program

In accordance with the described embodiment, lex_program is the C++lexical analyzer and is used to tokenize data sources. The lex_programcan be created from scratch or compiled from a file using alex_file_compiler. The member methods of this class include alex_program method, a Tokenize method, a Begin method and aGetNextTokens each of which is described below.

lex_program::lex_programlex_program::lex_program(ulong lexOptions=0)

Parameters

lexOptions—[in]. Flags to control the behavior of the program.

Lexer::opt_lineCounts The lexer will manually keep track of characterand line position. Lextoken's returned from this program will containvalid charNum and lineNum fields. This should only be used when thisinformation is important, otherwise it is not recommended because thereis a modest performance hit involved in keeping track of thisinformation. lexer::opt_firstMatch This option instructs the lex programto prefer first matches. (See how matching occurs). This is usuallycontrolled by the input .lex file and the lexer::lexer class. It isrecommended that this value be set in the lexer class and not here.

lex_program::Tokenizevirtual bool Tokenize(const spchar* pData, ulong length,std::vector<lextoken>& vcTokens, bool bResetState=true);

Given source text, this method tokenizes the source data and returnslextokens.

Parameters

pData—[in] pointer to the data to be tokenized

length—[in] length of data to be tokenized

vcTokens—[out] list of tokens generated by the content. Tokens areappended to the end of it.

Remarks

This method tokenizes the entire content. It is an alternative orsimplification to calling lex_program::begin( ), and thenlex_program::get_next_token( ) iteratively until it returns false.

lex_program::Beginvirtual bool Begin(const spchar* pData, ulong length, bool bReset=true);

This method is used to set the source data for the lexical analysis.Call this before calling lex_program::GetNextToken. It is not necessaryto call this method if using lex_program::tokenize.

Parameters

pData—[in] pointer to the data to be tokenized

length—[in] length of data to be tokenized

bReset—[in] reset the stack and variable state of the lexer back todefault.

Returns

true

lex_program::GetNextTokensvirtual bool GetNextTokens(std::vector<lextoken>& toks);

This method is used to retrieve the next token from the input stream. Itmay return more than one token. Use this function to iteratively runthrough the data in as atomic a way as possible. This method returns(true) until the end of data is reached. It is possible that the returntoken list is empty even if the return value is true.

Parameters

Toks—[out] vector of tokens. New tokens are appended onto this list andthe list is NOT CLEARED by this method, users must clear the listmanually if this is the desired effect.

Returns

false when the entire string has been tokenized, to the best of theprograms ability.

Remarks

Lex_program::begin( ) must be called before calling GetNextTokens.

lex_program_compiler

The class lexprogram_compiler is a class that converts a stream of textin .lex format (described above) into a lex_program which can be usedfor lexical analysis. The member methods in this class include acompil_lex method described below.

lex_program_compiler::compile_lexbool compile_lex(const char* pData, long nDataLength, lex_program&lexProgram, std::vector<lexfileerror_t>& errors);

Given a pointer to .lex formatted data and a data length, returns aninstantiated lex program capable of tokenizing streams as specified inthe pData.

Parameters

pData—[in] pointer to the data. Use NextIt::LoadDiskFilelntoString( . .. ) or some other disk file loading method to load the lex file intomemory.

nDataLength—[in] number of .lex formatted bytes contained in the pDatapointer

lex_Program—[out]—compiled program

errors—descriptive list of errors, if any.

Returns

bool—true if the compiled succeeded without errors or warnings. Theapplication is responsible for determining if errors or warnings warranta stoppage. Recommended: stop and display errors.

lextoken

This data class is the return class of the lex_program and represents atoken. It is designed for efficient parsing. In addition to returning atoken constant, it also returns positional information and lengthinformation of the source text that produced the token, which isimportant for language processing.

Member data

lexfilepos_t      startPos typedef struct { ulong lineNum;    // 0 basedline number ulong charNum;    // 0 based character index ulong pos;  //absolute position in the buffer } lexfilepos_t;

This is the starting information within the source data. If thelex_program was created with the lexer::opt_lineCounts, the lexfilepos_twill also contain a valid character and line number. startPos.posspecifies the exact byte position in the source data.long length;

This is the length in bytes of text representing this token.

An example usage of the file position information would be to create astring representing the exact characters captured by the token. Such as:std::string str(&pData[tok.startPos.pos], tok.length);ulong idToken

This is the unique identifier for this token. The id is a unique hashvalue defining the lexical token, or type, which the lexer hasrecognized. In this implementation, the hashing program is the systemhasher used by many subsystems, WordHashG.

Knowledge Bases

As noted in FIG. 1, one of the components that utilized by system 100 isa knowledge base component 104. In the illustrated and describedembodiment, knowledge base component 104 is implemented, at least inpart, utilizing one or more files that are defined in terms of ahierarchical, tag-based language which, in at least some embodiments,can be used to set up cases of text that matches incoming data or text,and define responses that are to be triggered in the event of a casematch. In the illustrated and described embodiment, the tag-basedlanguage is referred to as “Functional Presence Markup Language” orFPML. Effectively, the FPML files are utilized to encode the knowledgethat the system utilizes.

FPML

The discussion provided just below describes aspects of the FPML thatare utilized by system 100 to implement various knowledge bases. It isto be appreciated and understood that this description is provided asbut one example of how knowledge can be encoded and used by system 100.Accordingly, other techniques and paradigms can be utilized withoutdeparting from the spirit and scope of the claimed subject matter.

Preliminarily, FPML is an extensible markup language (XML) that can beutilized to define a surface-level conversational, action-based, orinformation acquisition program. FPML can be characterized as a statefulexpression parser with the expressiveness of a simple programminglanguage. Some of the advantages of FPML include its simplicity insofaras enabling ordinary technical people to capture and embody a collectivebody of knowledge. Further, FPML promotes extensibility in that deepersemantic forms can be embedded in the surface level engine. In addition,using FPML promotes scalability in that the system can be designed toallow multiple robots to run on a single machine, without significantperformance degradation or inordinate memory requirements. That is,preliminarily it should be noted that one application of the technologydescribed in this document utilizes robots, more properly characterizedas bots, to provide implementations that can be set up to automaticallymonitor and/or engage with a particular cyberspace environment such as achat room or web page. The knowledge bases, through the FPML files, areeffectively utilized to encode the knowledge that is utilized by thebots to interact with their environment.

As noted above, FPML allows a user to set up “cases” of language textthat match incoming sentences and define responses to be triggered whenthe case matches. In accordance with various embodiments, cases can beexact string matches, or more commonly partial string matches, and morecomplicated forms. FPML also supports typed variables that can be usedfor any purpose, for example, to control which cases are allowed to fireat a given time, thereby establishing a “state” for the program. Typedvariables can also be used to set and record information aboutconversations that take place, as well as configuration settings for oneor more robots, as will become apparent below.

In accordance with one embodiment, any suitable types of variables canbe supported, e.g. string variables, floating point variables, numbervariables, array variables, and date variables to name just a few.

As noted above, FPML is a hierarchical tag-based language. Thediscussion provided just below describes various specific tags, theircharacteristics and how they can be used to encode knowledge. Eachindividual tag is discussed under its own associated heading.

fpml tag

The fpml tag is used as follows:

<fpml> ... </fpml>

The FPML object is the top level tag for any fpml program. It enclosesor encapsulates all other tags found in a document or file. The fpml tagcan contain the following tags: <unit>, <rem>, <situation>, <if>,<lexer> and <load>. It should be noted that <rem name=“variablename”value=“variableValue”> is used to specify initial variables for the XML.When an FPML file is loaded, any <rem> at whose direct parent is <fpml>is evaluated. This mechanism is used to set up initial values forvariables and is used often. As an example of the fpml tag, consider thefollowing:

<fpml> <unit> <input>I like dogs</input> <response>I like dogs too, <acqname=“name”/>! </response> </unit> </fpml>

This example fpml file has one case, which recognizes the string “I likedogs”, and responds with “I like dogs too” followed by the value of thevariable “name”, which by convention is the name of the user.

load tag

The load tag is used as follows:

<load filename=“path to file”/>

This instruction directs an fpml interpreter to load the fpml filespecified by “path to file”. This path may be a fully qualified orpartial path from FPML file in which the <load> tag appears. The loadtag is contained in <fpml>, and does not contain other tags as the tagshould be open-closed. As an example of the load tag, consider thefollowing:

<!-- Load the fpml program defined in braindead.fpml !--> <loadfilename=“C:\fpml\braindead.fpml”/> <loadfilename=“\files\LA010189-0003.fpml”/> <load filename=“.\words.fpml”/><load filename=“words.fpml”/>

The first form loads a file from fully qualified path. The second formloads the file from a subdirectory of the directory in which this fileis located. The third loads from the current directory, as does theforth form.

lexer tag

The lexer tag is used as follows:

<lexer filename=“path-to-file”/>

This instruction directs the fpml interpreter to load and use thespecified .lex file (described above) for breaking up incoming text intoword tokens. This is important because even though fpml is a word-basedparsing language, there is no absolute definition of what constitutes aword. The lexer program can also categorize words and surface thisinformation to the fpml. This is discussed in more detail below inconnection with the <input> tag reference. The lexer tag does notcontain other tags and should be open-closed, and is contained in the<fpml> tag. As an example of the lexer tag, consider the following:

<load filename=“C:\fpml\words.lex”/> <load filename=“\files\words.lex”/><load filename=“.\words.lex”/> <load filename=“words.lex”/>

The first form loads from a fully qualified path. The second form loadsfrom a subdirectory “files” relative to the directory in which theloading file lives. The third and fourth forms load the file located inthe same directory in which the loading file lives.

unit tag

Use of the unit tag is as follows:

<unit> ... </unit>

The unit tag is a “case” in the system whose subtags identify the textthat it matches, and the response that should be taken in the presenceof a match. The unit tag must contain the following tags: <input> and<response>, and can contain: <prey> and <prev_input>. The unit tagcontained in the tags: <fpml>, <if>, <cond> and <situation>.

The <input> tag is used to specify a text pattern to match. The optional<prey> and <prev_input> tags contain expressions that match previousdialog either from the user or from a robot. The <response> tagspecifies the output when a match occurs. As an example of how this tagis used, consider the following:

<unit> <input>I like [.]</input> <response>I like <wild index=“1”/>too,<acq name=“name”/>! </response> </unit>

This example fpml file has one case, which recognizes the string “I like[any single word] ”, and responds with “I like “% incoming-word % too”followed by the value of the variable “name”, which by convention is thename of the user.

input tag

Use of the input tag is as follows:

<input>text-input-expression</input>

The text contained within the input tag defines the words andexpressions which will trigger the response encapsulated by the<response> tag. This tag contains text and no inner tags are evaluated.The input tag is contained in the unit tag. Using the“text-input-expression”, the text contained within the <input> tag canhave a special format. It can be characterized as a word-based regularexpression. As an example of how this tag can be utilized, consider thefollowing:

<input>I like dogs</input>

This matches the sentence “I like dogs” and nothing else, from theincoming text. Consider now the following use of this tag:

<input>I like +</input> <input>I like [+]</input>

This matches a sentence which begins with “I like” and is followed byone or more words. Additionally, consider the following example:

<input>I like *</input> <input>I like [*]</input>

This matches a sentence which begins with “I like” and is followed byzero or more words. It matches both “I like” and “I like you overthere”. Further, consider the following example:

<input>I like [.]</input>

This matches a sentence which begins with “I like” and is followed byany single words. For example, it matches “I like you”, but not “I likethe pickles” or “I like”. Thus, the expression [.] matches a singleword. Consider the following examples:

<input>* I like *</input> <input>* I like +</input> <input>+ I like+</input> <input>* I like [.]</input>

As indicated above, input expressions can contain more than one wildcardof any kind anywhere as long as the wildcards are separated by at leastone space from the literals.

The input tag can also utilize embedded expressions. Embeddedexpressions are bracked with ‘[’ and ‘]’. These bracketed expressionsare called queried-wildcards and are used to add expressiveness to theinput language. The format of this construct is as follows:[match-expression from_expression where_expression]

The following examples match expression syntax:[ANY(word1, word2, word3, . . . ) from ‘wildcard’] (where wildcard is *,+, .[ANY(word1, word2, . . . )] the wildcard ‘+’ is implied[ANY(w1, w2) AND NOT ANY(w3, w4 . . . ) from +|*|.][VAR(bot_name) from +]

The function ANY(word1, word2, word3, . . . ) matches any of thespecified words, e.g. <input>[ANY(books, magazines, pictures) from+]</input> matches “books”, “magazines” and “pictures”. The functionALL(word1, word2 . . . ) matches all of the specified words. Thefunction VAR(variableName) matches the incoming string against avariable name, e.g. <input>[VAR(bot_name) from .]+</input> recognizesthe bot name from the beginning of the sentence.

Consider also the function:REGEX(perl5regularexpression);<input>[REGEX(\$[\d]+(\.[\d]*)?) from+]</input>

This function matches money. The regular expression operates on eachword subsumed by the star, looking for a match.

Various operators can be utilized within the input tag among whichinclude the NOT, AND, OR, “ANY(w1, w2) AND NOT ANY(w3, w4)”, and“VAR(bot_name)” operators.

The operator from_expression can be used and is optional. It specifiesthe wildcard of the queried-expressionfrom +//one or morefrom .//exactly one.

-   -   If the from-expression is not specified, it is assumed to be the        ‘+’ wildcard.    -   The operator where-expression    -   The where-expression is used to constrain the match further.        Currently this is used to constrain a match to a given lexical        token type. For instance if an application is looking for        e-mails, it could, create a pattern that accepts only e-mail        types, as created by the lexer.    -   [. WHERE TYPE==EMAIL]    -   This queried-wildcard expression would match any word whose type        is EMAIL.

The lexer, in addition to splitting words and sentences, also producestokens, which are characterizations of the graphic word. A lexicalanalyzer can, for example, recognize URLs, IP addresses, Dollars, andthe like, as noted above. This information is available to the patternmatcher and can be used to match “types” of data. Consider the followingexample:

<unit> <input>* [where TYPE==URL] *</input> <response>URL: <wildindex=“2”/></response> </unit>

This unit matches any sentence containing a URL. In this example, theresponse is to simply provide the URL back to the user. A morecomplicated example can look for a particular URL. As an example,consider the following:

<unit> <input>* [REGEX(spectreai) from . where TYPE==URL] *</input><response>URL: <wild index=“2”/></response> </unit>

This unit looks for URLs containing the string “spectreai” anywhere inthem.

In an implementation example, matching can proceed in a case insensitiveway. That is, for a given sentence, all the <unit>'s are given a chanceto fire (assuming an <if> or <cond>) does not prevent this. Given this,it is likely that there may be more than one match for a given string.For example:

<unit> <input>*</input> <response>I don't understand you</response></unit> <unit> <input>+ what is your name +</input> <response>My name is<acq name=“bot_name”/></response> </unit>

If an incoming sentence is “Hey, what is your name dude?”. Both of thesepatterns actually match this string. Desirably, however, one wants thesecond pattern to evaluate. Given that the matcher is probabilistic, thesecond match, the one which recognizes the most known text, is chosen.The general idea is that the end-user should not have to worry aboutthis. Picking the best match is the responsibility of the fpmlinterpreter. In the event of identical patterns, or identicalprobabilistic matches, the match that is loaded last wins. Consider thefollowing example:

<unit> <input>+ what is your name +</input> <response>My name is <acqname=“bot_name”/></response> </unit> <unit> <input>+ what is your name+</input> <response>Who cares!<response/> </unit>

They both match the same text with the same probability. However, as thesecond match was the last loaded, the second will fire.

prev tag

Use of this tag is as follows:

<prev>text-input-expression</prev>

The <prev> element is part of the <unit> tag and declares a constrainton the matcher. In order for a sentence to match this unit, the<prev>“text-input-expression”</prev> must also match what the robot saidpreviously. That is, the unit will match ONLY if what the robot saidprior to the current input can match against “text-input-expression”.

The format for text-input-expression is identical to the format of datain the <input> tag, thus attention is directed to the input tag fordetails on syntax. The prey tag has an optional index attribute whichspecifies how many places back to go in a robot's conversation historyto find a match. The default value is one. This means that the lastsentence the robot said must match against the text-input-expression inorder for the <unit> to match. If the index attribute is less than zero,e.g. <prey index=“−5”>*yes*</prev_input>, then all of the past fivesentences of the robot history will be matched. If any are matched, theunit will be allowed to match the <input> tag.

Consider the following FPML example of a conversation relating to goingto a movie.

<unit> <input>yes</input> <prev index=“1”>* go to a movie *</prev><response>which one?</response> </input> <unit> <input>* matrix *<input><prev>which one</prev> <response>The Matrix it is. When?</response></unit> <unit> <input>*</input> <prev>* the matrix it is * when *</prev><response>Sounds good</response> </unit>

Example dialog:

robot>do you want to go to a movie?

user>yes

robot>which one?

user>I like the matrix

robot>The matrix it is. when?

user>11:30

robot>Sounds good.

prev input tag

Use of the prev_input tag is as follows:

<prev_input index=“1”>text-input-expression</prev_input>

The <prev_input> element is part of the <unit> tag and declares aconstraint on the matcher. In order for a sentence to match this unit,the “text-input-expression” must also match what the user saidpreviously. That is, the unit will match ONLY if what the user saidprior to the current input can match against “text-input-expression”.

The format for text-input-expression is identical to the format of datain the <input> tag. Thus, the reader is referred to the discussion ofthe input tag for details on syntax.

The prev_input tag has an optional index attribute which specifies howmany places back to go in the user's history to find a match. Thedefault value is one, which means that the last sentence the user saidmust match against the text-input-expression in order for the <unit> tomatch.

If the index attribute is less than zero, e.g. <prev_inputindex=“−5”>*yes*</prev_input>, then all of the past five sentences ofthe user history will be matched. If any are matched, the unit will beallowed to match the <input> tag.

This tag contains text—expression just like the input expression and iscontained in: <unit>.

response tag

Use of the response tag is as follows:

<response> </response>

The response tag holds elements that will evaluate when the <input> (and<prev . . . ) generate the best match for a given sentence. In someembodiments, the response tag defines what the robot will say or record.This tag is contained in: <unit> and can contain: text, as well as thefollowing tags: <cond>, <rand>, <op>, <if>, <acq>, <rem>, <cap>,<hearsay>, <impasse>, <lc>, <uc>, <sentence>, <swap_pers>, <swap_pers1>,<rwild>, <wild>, <recurs>, and <quiet>.

if tag

Use of this tag is as follows:

<if name=“variableName” value==“text-input-expression”> fpml-tags </if><if expr=”script-expression”> </if>

The if tag is used to control execution flow. If the specified variablecan be evaluated against the value, the contained nodes are turned on.If not, the contained nodes are not executed. Variables and the <if>expression allow the FPML programs to run in a stateful way. This tagcan contain the following tags: <unit>, <if>, and <situation>, and alltags the response tag can contain. This tag is contained in thefollowing tags: <fpml>, <response>, all tags the response tag cancontain, <if> and <situation>. The if tag can be used as an intra-unittag to control program flow. As example, consider the following:

  <if name=“name” value=“* tommy *”>   <unit>   <input>* HI *</input>  <response>It  has  been  a  long  time.  still  working on  the documentation</response>   </unit>   <unit>   ...   </unit>   </if>

In this situation, the units contained within the <if> statement willonly be evaluated if the user name “name” is something with “tommy” init. Although this is an elementary example, this shows how to usearbitrary variables to control program flow.

The value=“ . . . ” attribute of the <if> tag can be any expression thatis valid in the <input> text. It can also be “?”. When value is ‘?’, theconditional evaluates to true if the variable is set and is falseotherwise. This construct can be used in <if>, <cond>, and <situation>.Alternatively, the <if> tag can use “expr=” instead of name and valuepairs. This allows code expressions to be used to perform the test.Additionally, the <if> tag can be used to control program flow in the<response> tag. As an example, consider the following:

<unit> <input>*</input> <response>

Hello there.

<if name=“vTalkative” value=“true”>

Goodness, my. It is a lovely day. I wonder where the other people are. Ilove to chat.

</if> How are you? </response> </unit>

Another silly example, if vTalkative is set to “true”, then the textunderneath the if statement will be added to the response string.

situation tag

Use of this tag is as follows:

<situation name=“input-text-expression”>

The situation tag is another program control tag and is used to controlwhich units get precedence over all other units. It is useful inmanaging discourse. However, it is not used in the <response> tag. Thistag can contain the following tags: <unit> and <if>, and can becontained in: <fpml> and <if>.

As an example, consider the situation “computers” below:

  <unit>   <input>* Computers *</input>   <response>Lets   talk   about  computers.   <quiet><rem name=“situation”>computers</rem><quiet/>  </unit>   <situation name=“* computers *”/>   <unit>  <input>[ANY(buy, purchase, lease, rent)]</input>  <response>I've had success with Dell. Can go to dell online atwww.dell.com</response>   </unit>   <unit>   <input>[crash, crashed,crashing, bomb)]</input>   <response>Which operating system are yourunning?</response>   </unit>   <unit>   <input>* XP *</input>  <prev_input>[crash, crashed, crashing, bomb)]</prev_input>  <response>which program?</response>   </unit>   ...   <unit>  <input>*</input>   <response>We were talking about computers. wouldyou like to talk about something else?   </response>   </unit>  </situation>

The situation tag provides a way to encapsulate a particular subject andprotect it somewhat from outside <unit>. It's probabilistic<input>*<input> in the above situation only if no other <unit>s in theglobal space produce a better match.

In the above example, <situation name=“computers*” is syntacticallyequivalent to this IF statement:

<if name=“situation” value=“* computers *>.

RESPONSE TAGS

As noted above, tags within the <response> generate output or recordinformation. With a couple of exceptions, such as <cond>, every validresponse tag can contain all other tags located within the response.

rand tag

Use of this tag is as follows:

<rand> <op> response-expression(1)</op> <op> response-expression(2)</op><op> response-expression(3)</op> </rand>

The rand tag picks one of it's sub-elements at random and uses it togenerate the response. This tag is contained in <response> and contains<op>. As an example of this tag's use, consider the following:

<unit> <input>HI + </input> <response> <rand> <op>Hello <acqname=“name”/>!!!</op> <op>Hidy ho!</op> <op>Cheers!</op> </rand><rwild/> </response> </input>

cond tag

Use of this tag is as follows:

<cond>

The cond tag allows for conditional evaluation inside the <response>tag. It is a complicated form and has three levels of expressivity. Thefirst level of expressivity is where it is identical to the <if> tag andcan assume the same places and locations. For example,

<cond name=“variableName” value=“text-input-expression> <condexpr=”script-expression”>.

The second level of expressivity is where the cond tag identifies thevariable name, but not the variable value. In this case, the cond tagshould contain only <op> tags. Each op tag will define the value field.The <op> which matches best is chosen for the evaluation. As an example,consider the following:

 <unit>  <input>SERVICECONNECTED</input>  <response>  <condname=“bot_name”>   <op value=“ScoobyDruid” >    /nickserv identifyoicu812    <impasse>!MASTER \0304I took care of the privacy and theidentity for you sir    </impasse>   </op>   <opvalue=“MonkeyKnuckles” >    /nickserv  identify  oicu812<impasse>!DELAY1</impasse><impasse>!MASTER \0304I took    care of the privacy and theidentity for you sir.</impasse>   </op>  <op><impasse>!MASTER  \0304This  nick  is  not registered</impasse>  </op>  </cond> </response> </unit>

In the third level of expressivity, <cond> has no attributes, and each<op> field will have both a “name” and “value” attribute. As an example,consider the following:

 <unit>  <input>SERVICECONNECTED</input>  <response>  <cond>   <opname=“bot_name” value=“ScoobyDruid” >    /nickserv identify oicu812   <impasse>!MASTER \0304I took care of the privacy and the identity foryou sir    </impasse>   </op>   <op name=“bot_name”value=“MonkeyKnuckles” >    /nickserv  identify  oicu812<impasse>!DELAY1</impasse><impasse>!MASTER \0304I took    care of the privacy and theidentity for you sir.</impasse>   </op>  <op><impasse>!MASTER  \0304This  nick  is  not registered</impasse>  </op>  </cond>  </response>  </unit>

Note that both forms have exactly the same behavior. There can also bedefault behavior for <cond> case expressions. Consider the example justbelow. If the variable “name” does not exist (via “?” construct),nothing is output. The default case is the last <op> tag without anyexpression. This will always evaluate, but only if nothing above it isfired.

<unit> <input>* HI *</input> <response> Hello <cond name=“name”/> <opvalue=“?”></op> <op>, <acq name=“name”/></op> </cond> . </response></unit>

op tag

Use of the op tag is as follows:

<op>fpml-response</op> <op value=“variableValue”>fpml-response</op> <opname=“variableName” value=“variableValue”>fpml-response</op>

This tag is used to express a conditional or random “case” for output.See, e.g. <cond> and <rand> for usage. This tag contains text and anyvalid response tag, and is contained in <cond> and <rand>.

rem tag

Use of this tag is as follows:

<rem name=“varName” value=“varValue”/> <rem expr=”script-expression”><rem name=“varName”>The Variable Value</rem>

This tag is used to set a variable to a specified value. The names andvalues are arbitrary and can be any value. This tag can contain text andany tag which is valid within the <response> tag, and is contained in<fpml> (for variable initialization) and <response> (for setting newvariables). As an example of this tag's usage, consider the following:

<fpml> <rem name=“bot_name” value=“Mr. Z”/> <remname=“bot_favorite_color” value=“purple”/> ...

When the fpml loads, these variables are initialized to these values.Additionally consider the following example:

 <unit>  <input>Let * talk about the +</input>  <response> Sounds  great.  <quiet><rem  name=“situation”><wildindex=“2”/></rem></quiet>  Do you have strong feelings about <wildindex=“2”/>?  </response>  </unit>

Within the unit tag, this sets the “situation” to the wildcard, and asksa general question.

acq tag

Use of this tag is as follows:

<acq name=“variableName”/>

This tag is used to retrieve a variable value and contains no othertags. This tag is contained in <response> or any valid response tagexcept <cond> and <rand>. As an example of this tag's use, consider thefollowing:

<unit> <input>* HI *</input> <response>Well hello, <acqname=“name”/></response> </unit>

quiet tag

Use of this tag is as follows:

<quite>

This tag is used to evaluate inner tags but to nullify the responsethese tags generate. This tag contains any valid tag within the<response>, and is contained in <response> and any valid tag within the<response>. As an example of this tag's use, consider the following:

 <unit>  <input>* computers *</input>  <response> I  am  a  computer.  <quiet><remname=“situation”>computers</rem></quiet>  </response>  </unit>

Without the quiet tag, the text “computers” would be added to theresponse. With the quiet tag, it is not.

wild tag

Use of this tag is as follows:

<wild/> <wild index=“1 based wildcard index”/>

This tag is used to retrieve the value of the wildcards that are unifiedin the <input> expression. This tag contains no tags and is contained in<response> or any valid response tags. As an example of this tag's use,consider the following:

<unit> <input>* I like *</input> <response><recurs><wildindex=“1”/></recurs>. I like <wild index=“2”/> </response> </unit>

rwild tag

Use of this tag is as follows:

<rwild/> <rwild index=“1 based wildcard index”/>

The engine supports recursion of responses. There are two recursion tags<recurs> and <rwild>. These tags submit their evaluations back into theengine for response. This filtering mechanism allows language syntax tobe reduced iteratively. <rwild> instruction is used to recurse on thefirst wildcard, and <rwild index=“2”/> is used to recurse on the secondwildcard. This tag contains no other tags and is contained in <response>or any response sub element. As an example of this tag's use, considerthe following:

<unit> <input>THE *</input> <response><rwild/></response> </unit>

This generic pattern will be matched only if no other better match isfound. In this case, the determiner is stripped off and the text isresubmitted for evaluation with the hope that the engine will betterrecognize the entity without the determiner.

recurs tag

Use of this tag is as follows:

<recurs>fpml-response</recurse>

As noted above, the engine supports recursion of responses and this isthe other of two recursion tags. These tags submit their evaluationsback into the engine for response. This filtering mechanism allowssyntax to be scraped iteratively. The <recurs> instruction, unlike<rwild>, can contain elements. These elements are evaluated and theresulting text is resubmitted as input to the fpml interpreter. As anexample of this tag's use, consider the following:

<unit> <input>DO YOU KNOW WHO * IS</input><response><recurs>WHO   IS   <wild/></recurs></response> </unit>

This example takes a more complex grammatical form and reduces it to amore generic form. Consider the synonym rewrite as follows:

<unit> <input>HI THERE</input> <response><recurs>HELLO1</recurs></response> </unit> <unit> <input>Aloha</input><response><recurs>HELLO1</recurs> </response> </unit> <unit><input>HIYA</input> <response><recurs>HELLO1</recurs> </response></unit> ..

This example allows for a complicated hello response, without having toduplicate the response expression across a variety of units.

impasse tag

Use of the impasse tag is as follows:

<impasse>

This tag in the <response> element forces a callout to the callingapplication with the evaluation text of its inner elements. This is usedto communicate information to the outer application. This tag iscontained in <response> or any response sub element, and contains textor any response sub element. A command structure can be utilized thatuses the impasse tag to trigger application specific operations.

cap tag

This tag capitalizes the first letter of the output of all its containedelements or text and is contained in any response tag, and contains anyresponse tag/text. As an example of its use, consider the following:

<cap>united states</cap> output: United States

<lc><uc> tags

These tags make the output of the contained elements all lower case <lc>or upper case <uc>.

<sentence> tag

This <response> tag will convert the contained text and elements into asentence form.

<swap_pers> tag

This tag transforms inner elements from first person into second person.

<swap_pers1> tag

This tag transforms inner elements and text from second person to thirdperson.

Script Expressions

In accordance with one embodiment, the FPML runtime (discussed in moredetail below) can support assignment and conditional testingexpressions. The syntax is ECMAScript, but it does not include theability to have control statements, or functions.

Script expressions are added to fpml through the expr=“scriptexpression” attribute. This attribute is valid in the following tags:<if>, <cond>, <op>, <rem> and <acq>. As an example of this tag's use,consider the following:

<if expr=”(var1 == 1 && var2 == 2.0)”> <if expr=”myVar == myVar1 + 1 &&profile[key_name] == ‘george’/> <rem expr=”key_name = 0;profile_array[key_name] = ‘Mark’; />

If one wishes to add more than one assignment expression in a single“expr” attribute, this is possible, by separating the expressionstatements with a semicolon ‘;’. This is useful for creating <rem>expressions which initialize a whole bunch of variables. If the remexpression is in the top level of the file, it will be evaluated whenthe FPML is instantiated in, for example, a bot. As an example, considerthe following:

<fpml> <rem expr=“ likes_cooking = 0; likes_eating = 1; likes_gasgrill =2; profile[likes_cooking] = 0.0; profile[likes_eating] = 0.5;profile[likes_gasgrill]=1.0; ” /> </fmpl>

In this case, the <rem> expression will be evaluated on bot startup, andall those variables initialized to these values.

Variables are loosely typed and can transform to new types withoutexplicit operators. New variables can be created on the fly and are casesensitive. For example, “Var1” is not the same as “var1”. Numbers arecreated simply by assigning a numeric value to a variable, e.g. Var1=1,Var1=1.23445 and the like. Strings are created by using the ‘ ’ singlequote, e.g. Var1=‘Mark’. Arrays are created simply by indexing avariable. If the variable exists, it will be retyped as an array; if theindex is greater than the size of the array (initially 0 length), thearray will grow dynamically, e.g. profile[0]=0.95; profile[1]=0.50;profile[2]=0.25; likes_food=0; likes_beach=1; likes_coffee=2;profile[likes_food]=0.95; profile[likes_beach]=0.50;profile[likes_coffee]=0.25; profile[likes_coffee+1]=‘mark’.

Arrays indices are full fledged variables and can loosely be types.Index[0] can be a string, while Index[1] can be floating point, Index[2]can be another array, and the like. The expression system does notimpose a limit on the dimension of the array, e.g.Array[0][1][0][0]=‘true’ is valid.

In one embodiment, the following operators grouping tags are supportedby the expression evaluator. Precedence rules of EMCAScript have beenadopted.

= assignment operator == comparison operator != comparison operator ( )grouping operator > greater than < less than >= gte <= lte && LogicalAND || Logical or [ ] Array index ‘xx’ const string - v = ‘mark’ {..,..}constant array - v = {0, 1, 2, 3, 4, 5}; ! logical not + add operator −subtract operator * multiply operator / divide operator

The following keywords are currently defined for the language: true,false.

Probabilistic Expression Matching

Having now considered the above discussion of the functional presenceengine and the knowledge bases, consider the following. As can surely beappreciated, FPML, at a basic level, can be used to define a list ofregular expressions which trigger a response when incoming data ismatched against the expression. It is desirable that the matchingprocess be as smart as possible insofar as its ability to handlecollisions. A collision occurs when incoming text matches two or moreFPML units. To address the issue of collisions and in accordance withone embodiment, a statistical or probabilistic methodology is utilized.For example, in accordance with this embodiment, instead of returningBoolean values, the process can return a probabilistic score thatidentifies how close the input text is matches to the particularknowledge base unit. If the scoring methodology is sound, then unitinterdependence is much less of an issue and the highest ranking FPMLunit which matches the incoming text is also guaranteed to be the mostsemantically relevant to the text and thus captures the most informationof all competing knowledge base units.

As noted above, more than one <unit> may unify successfully againstincoming text. This is expected and in some instances desirable. TheFPML Runtime (discussed in more detail below), uses a probabilisticmethodology to choose the best unification among competing units. Thebest unification, in accordance with one embodiment, is the unificationthat provides the best semantic coverage for the incoming text. This isachieved, in this embodiment, by scoring exact graphic word matches at ahigh value and scoring wildcard matches lower. As an example, considerthe following FPML:

<!-input 1 !--> <input>OSAMA IS EVIL</input> ... <!-input 2 !--><input>* OSAMA * </input> <!-input 3 !--> <input>OSAMA IS *</input>

Given the string “OSAMA IS EVIL”, more semantic information is uncoveredby selecting input 1 as the best unification. Given the string “OSAMA ISMOVING”, more semantic information is uncovered by selecting input 2.Semantic context is garnered when graphic words in the <input> matchgraphic words in the incoming text. The more graphic words which matchexactly, the more semantic information is uncovered. Thus, ageneralization is that for any incoming text, one wants to match it toan input which uncovers the most graphic words either directly orthrough a functional process.

Consider the following mathematical approach. Each <input> expression Eis represented by (e₁ . . . e_(k)) terms, where each term can be eithera graphic word, a wildcard type, or an embedded functional expression,and k is the total number of terms. Given this, consider the followingtable which separates four expressions into their component terms:

[ANY(OSAMA, OSAMA is evil OSAMA is a * * OSAMA * USAMA) from +] e1 =osama e1 = osama e1 = KSTAR e1 = STAR + e2 = is e2 = is e2 = osamaFn(ANY(OSAMA, e3 = evil e3 = a e3 = KSTAR USAMA)) e4 = KSTAR

Each incoming sentence S is split into words (w₁ . . . w_(n)), whereeach word represents a graphic word as defined by the lexical analyzerand n is the total number of words in the sentence. Thus, “Osama is aevil man” breaks down as follows:

w₁=Osama

w₂=is

w₃=a

w₄=evil

w₅=man

The unifier takes an <E,S> and attempts to create a resultant list R ofsize k where r_(i) is a list of words subsumed by e_(i). If such a set Rcan be produced, S can be said to unify against E. The incoming text“Osama is moving out” unifies against three of the 4 specified inputs asfollows:

OSAMA *moving* * OSAMA * [ANY(OSAMA, USAMA) from +] r₁ = osama r₁ = Ø r₁= STAR + Fn(ANY(OSAMA)) r₂ = is r₂ = osama R₁ = Osama is moving out R₃ =moving r₃ = is moving out R₄ = out

In this example, the FPML engine needs to make a decision about which isthe best unification. It is easy to observe that “OSAMA*moving*” wouldbe the <input> which uncovers the greatest semantic context. Thus, thisis the preferred unification. Using R, a probability is calculated byassigning a score to each R_(i) and then summing them and dividing bythe number of words in the input (n)+the number of KLEENSTAR matcheswhich unify against nothing.

In accordance with one embodiment, there are two methods that can beutilized for assigning scores. This first method is an ad-hoc methodthat works well in the absence of word statistics. As an example,consider the following:

Score_(i)= If E_(i) is a graphic word type, the score for r_(i) is .95.If E_(i) is a MATCHONE wildcard type, the score is .7. If E_(i) is aSTAR (one or more), the score is .55 times the number of words in r_(i).If E_(i) is a KLEENSTAR, the score is .45 times the number of words inr_(i).

If E_(i) is a functional type, the score is dependent on the function.This value is usually calculated by adding high scores for terms thatmatch the function, and low scores for extra terms.

The second method can utilize term weights, such as inverse documentfrequency. Here, the graphic words can be assigned scores based on thesemantic information returned by the word and do not need to be aconstant. As an example, consider the following.

  Count_(i)=   If E_(i) is a graphic word or MATCHONE, 1.   If E_(i) isKLEENSTAR, number of terms captured by the wildcard.   If E_(i) isKLEENSTAR and number of terms is greater than 0, number of termscaptured by the wildcard.   If E_(i) is KLEENSTAR and the number ofterms is 0, 1.

The score is thus computed as follows:

$\frac{{{Prob}( E \middle| S )} = \frac{\sum\limits_{i = 1}^{i<=k}{{Score}( r_{i} )}}{\sum\limits_{i = 1}^{i<=k}{{Count}( r_{i} )}}}{{{Prob}( E \middle| S )} = {\sum\limits_{i = 1}^{i<=k}{{Score}( r_{i} )}}}$

The first equation is the normalized approach. In this methodology,scores from different inputs can be compared to each other in ameaningful way.

In many applications, relative comparisons among different inputs is notnecessary, and there are some consequences of the normalization methodsrelated to matching. Hence, the second equation constitutes anunnormalized variant, to remove these side effects.

Using the above scoring equation, the scores are calculated as follows:

OSAMA * moving * Osama is moving out (.95  .45 .95  .45)/( 1 + 1 +1 + 1) = .7 * OSAMA * Osama is moving out (0 .95 .45 .45 .45)/(1 + 1 +3) = .46 [ANY(OSAMA, USAMA) from +] Osama is moving out (.95  .55.55  .55)/(4) = .65

It is also reasonable to remove the normalization step from theequation. In this case, generated scores will be significantly higher,and reflect precisely the amount of data that has been unified,regardless of the size of the source string. The advantage is thatmatches will generate larger numbers. The disadvantage is that scoresfrom generated by an input pattern pair can only be reasonably comparedwith the results of unification from other patterns using the sameinput. Comparing results from unifications of different inputs is notpossible when normalization is turned off.

Exemplary Software Architecture

The following discussion describes an exemplary software architecturethat can implement the systems and methods described above. It is to beappreciated that the following discussion provides but one example andis not intended to limit application of the claimed subject matter.Accordingly, other architectures can be utilized without departing fromthe spirit and scope of the claimed subject matter.

FIG. 3 shows an exemplary system generally at 300 comprising one or moreruntime objects 302, 304, 306 and 308 and one or more knowledge baseobjects 350, 352 and 354.

In the illustrated and described embodiment, the runtime objects aresoftware objects that have an interface that receives data in the formof text and produces text or actions. In one embodiment, the runtimeobjects are implemented as C++ objects. Knowledge base objects 350-354are software objects that load and execute FPML knowledge bases andhandle user requests. Together, the runtime objects and the knowledgebase objects cooperate to implement the functionality described above.

More specifically, in the illustrated and described embodiment, eachknowledge base object is associated with a single FPML file. Examples ofFPML files are described above. The knowledge base object is configuredto load and execute, in a programmatic manner, the FPML file. In someembodiments, FPML files can be nested and can contain links to otherobjects. This allows one broader knowledge base to include individualFPML files. This keeps the knowledge organized and makes it easier toedit domain specific knowledge. In the present example, runtime objectscan point to or otherwise refer to more than one knowledge base object,thus utilizing the functionality of more than one knowledge base object.Similarly, knowledge base objects can be shared by more than one runtimeobject. This promotes economies, scalability and use in environments inwhich it is desirable to receive and process text from many differentsources.

In the illustrated and described embodiment, the runtime objects containstate information associated with the text that it receives. Forexample, if the text is received in the form of a conversation in a chatroom, the runtime object maintains state information associated with thedialog and discourse. The runtime objects can also maintain stateinformation associated with the FPML that is utilized by the variousknowledge base objects. This promotes sharing of the knowledge baseobjects among the different runtime objects.

As an overview to the processing that takes place using system 300,consider the following. In the present example, the runtime objectsreceive text and then utilize the knowledge base objects to process thetext using the FPML file associated with the particular knowledge baseobject. Each of the runtime objects can be employed in a differentcontext, while utilizing the same knowledge base objects as otherruntime objects.

Now specifically consider knowledge base object 354 which is associatedwith a loaded FPML file N. As described above, the FPML file comprises ahierarchical tree structure that has <unit> nodes that encapsulate<input> nodes and <response> nodes. Each of these nodes (and others) isdescribed above. When a runtime object receives text, it passes the textto one or more of the knowledge base objects. In this particularexample, runtime objects 304 and 308 point to knowledge base object 354.Thus, each of these knowledge base objects passes its text to knowledgebase object 354. As noted above, each knowledge base object, through itsassociated FPML file, can be associated with a particular lexer thatperforms the lexical analysis described above. When the knowledge baseobject receives text from the runtime object(s), it lexically analyzesthe text using its associated lexer.

As noted above, because each runtime object can point to more than oneknowledge base object, and because each knowledge base object canspecify a different lexer, the same text that is received by a runtimeobject can be processed differently by different knowledge base objects.

Consider now the process flow when text is received by a runtime object.When the runtime object receives its text, it makes a method call on oneor more of the knowledge base objects and provides the text, through themethod call, to the knowledge base object or objects. The process nowloops through each of the knowledge base objects (if there is more thanone), looking for a match. If there is a match, the method returns ascore and an associated node that generated the score, to the runtimeobject. In the present example, assume that the FPML associated withknowledge base object 354 processes the text provided to it by runtimeobject 308 and, as a result, generates a match and score for the<input2> node of <unit 2>. The score and an indication of the matchingnode are thus returned to the runtime object and can be maintained aspart of the state that the runtime object maintains. In the event thatthere are multiple matches, a best match can be calculated as describedabove. Once the runtime object has completed the process of loopingthrough the knowledge base objects, and, in this example, ascertained abest match, it can then call a method on the matching node to implementthe associated response. Note that in the presently-describedembodiment, for each <input> node there is an associated <response> nodethat defines a response for the associated input. Exemplary responsesare described above. Thus, when the runtime object calls the knowledgebase object and receives a particular response, the runtime object canthen call an associated application and forward the response to theapplication.

Exemplary System Utilizing Runtime and Knowledge Base Objects

FIG. 4 shows an arrangement of components that utilize theabove-described runtime and knowledge base objects in accordance withone embodiment, generally at 400. In this system, one or more humanusers or monitors can interact with an application 402 which, in turn,interacts with a functional presence system 404. In accordance with oneembodiment, the application can comprise an agent manager component 403,which is discussed in greater detail below in the section entitled“Implementation Example Using Dynamic Agent Server”.

In accordance with the presently described embodiment, functionalpresence system 404 comprises a functional presence engine 406 whichitself comprises one or more runtime objects 408 and one or moreknowledge base objects 410. Each knowledge base object can have anassociated lex object 412 that is configured to perform lexical analysisas described above. The functionality of the runtime and knowledge baseobjects is discussed above and, for the sake of brevity, is not repeatedhere.

In addition, system 404 can comprise an information retrieval component414 which is described in more detail just below.

Information Retrieval

In accordance with one embodiment, information retrieval component 414utilizes the services of the functional presence engine 406 to processlarge numbers of documents and perform searches on the documents in ahighly efficient manner.

Before, however, describing the information retrieval process, a littlebackground is given so that the reader can appreciate the inventiveprocesses. One way to perform searches on documents is to perform aso-called linear or serial search. For example, assume that, given 4Gigabytes of data, an individual wishes to search for a particular termthat might be contained within the data, By performing a linear orserial search, a search would proceed linearly—byte by byte—until theterm was or was not found. Needless to say, a linear search can take along time and can be needlessly inefficient.

In accordance with the described embodiment, the information retrievalcomponent creates and utilizes a table whose individual entries point toindividual documents. Entries in the table are formed utilizing theservices of the functional presence engine.

As an example, consider FIG. 5 which shows a system generally at 500that comprises a functional presence engine that utilizes one or moreruntime objects and one or more knowledge base objects 506. Aninformation retrieval component 508 utilizes functional presence engine502 to create and use a table 510 which is shown in expanded form justbeneath system 500.

In accordance with one embodiment, functional presence engine 502receives and processes data which, in this example, comprises a largenumber of documents. The documents, under the influence of thefunctional presence engine and its constituent components, undergoeslexical analysis and tokenization (typing) as described above. As theseprocesses were described above in detail, they are not described again.The output of the tokenization or typing process is one or more tables.

Specifically, in the present example, table 510, shown in expanded form,includes a column 512 that holds a value associated with a particularword found in the documents, a column 514 that holds a value associatedwith the type assigned to the word in the tokenization process, and acolumn 516 that holds one or more values associated with the documentsin which the word (type) appear. Thus, in this example, each row definesa word, an associated type and the document(s) in which the word or typeappears. So, for example, word A is assigned type 1 and appears indocuments 1 and 3.

In the illustrated and described embodiment, the typing of the dataremoves any need to do a key word search. Instead, one can search forvarious types and can specify, through the information retrievalcomponent 508, a regular expression to be used to search the varioustypes. For example, one might specify a search for all documents thatcontain an email address that contains a certain specific term. A searchon the type “Email addresses” identifies all of the email addresses fromcolumn 514. A regular expression search of column 512 can then identifyall of the matches whose associated documents are indicated in column516.

Although this is a simple example, as the skilled artisan will surelyappreciate, what begins to emerge is a system that allows for structuredtypes of operations to be performed on unstructured data.

In the illustrated and described embodiment, the information retrievalprocess is passive in that it is provided with information and thenprocesses the information, along with the functional presence engine toprovide a robust searching and retrieval tool. In this particularexample, the information retrieval component is not responsible forseeking out and acquiring the information that it searches.

Exemplary Method

FIG. 5 a illustrates steps in a method in accordance with oneembodiment. The method can be implemented in connection with anysuitable hardware, software, firmware or combination thereof. In oneembodiment, the method can be implemented in connection with systemssuch as those shown and described in FIGS. 1-5. Step 530 receives textfrom a data origination entity. A data origination entity, as used inthis document, is intended to describe an entity from which data isobtained. For example, in the context of the Internet, a dataorigination entity can comprise a server, a server-accessible datastore, a web page and the like. In the context of a company intranet, adata origination entity can comprise a network-accessible data store, aserver, a desktop computer and the like.

Step 532 probabilistically parses the text effective to tokenize textportions with tokens. In the illustrated and described embodiment,probabilistic parsing is accomplished using one or more matching rulesthat are defined as regular expressions in an attempt to match to textreceived from the data origination entity. Examples of how probabilisticparsing can take place are described above. Hence, for the sake ofbrevity, such examples are not repeated here. Step 534 conducts a searchon the tokens. Examples of how and why searches can be conducted aregiven above and, for the sake of brevity, are not repeated here.

Implementation Example Using Dynamic Agent Server

In accordance with one embodiment, the above-described systems andmethods can be employed in the context of a system referred to as the“dynamic agent server.” The dynamic agent server is a client-serverplatform and application interface for managing and deploying softwareagents across networks and over the Internet. The dynamic agent serveris enabled by the functional presence engine and, in particular, theruntime objects that are created by the functional presence engine. Thedynamic agent server can be configured to incorporate and use variousapplications and protocols, ingest multiple textural data types, andwrite to files or databases.

As an example, consider FIG. 6 which shows a system comprising a dynamicagent server 600 that comprises or uses a functional presence engine 602which, in turn, utilizes one or more runtime objects 604 and one or moreknowledge base objects 606. An application 608 is provided and includesan agent manager component 610 which manages agents that get created anddeployed. One or more data sources 612 are provided and include, in thisexample, IRC data sources, TCP/IP data sources, POP3 data sources, wgetdata sources, htdig data sources among others. The data sources can beconsidered as pipeline through which data passes. In the presentexample, data can originate or come from the Internet 614, from anetwork 616 and/or various other data stores 618. Data sources 612 arethe pipeline through which the data travels.

In accordance with one embodiment, an agent can be considered as aninstance of a runtime object coupled with a data source. In theillustrated and described embodiment, application 608 controls the typesof agents that are generated and deployed. In the present example, thereare two different types of agents that can be created. A first type ofagent gets created and opens a communication channel via a data sourceand simply listens to a destination such as one of the data originationentities names above, i.e. Internet 614, network 616 and/or data store618. This type of agent might be considered a passive agent. A secondtype of agent gets created and opens a communication channel via a datasource and interacts in some way with the destination. This second typeof agent communicates back through application 608 to the functionalpresence engine 602. This type of agent might be considered an activeagent.

In the illustrated and described embodiment, an agent (i.e. a runtimeobject 604 and a data source) is associated with one or more knowledgebase objects 606. The knowledge base objects define how the agentinteracts with data from a data origination entity. That is, thefunctional presence engine 602 is utilized to direct and control agents.In the illustrated and described embodiment, there is a one-to-oneassociation between a particular runtime object and data source definedby the associated agent.

Because of the componentized nature of the runtime objects, largenumbers of agents can be created and deployed across various differenttypes of systems. Additionally, as the runtime objects can be associatedwith more than one knowledge base, a very robust information processingsystem can be provided.

As an example, of how the dynamic agent server can be utilized, considerthe following example. The wget data source is a mechanism which, incombination with a runtime object, goes to a particular web site and candownload the entire web site. That is, the agent in this exampleestablishes a connection with the web site, follows all of the links forthe web site, and downloads all of the content of that site. This, inand of itself, can provide a huge data problem insofar as moving from ahard target (the web site) to a very large collection of unstructureddata (the entire content of the site). The functional presence enginecan alleviate this problem by allowing the agent, in this instance, togo to the website and only pull information that is important byidentifying which pages are relevant as defined by the FPML.

Agent Based Information Retrieval Response System

In accordance with another embodiment, the above-described systems andmethods can be employed to deploy multiple agents across a network togather, read, and react to various stores of unstructured data. Thesystem can utilize an analysis tool that tags, indexes and/or otherwiseflags relationships in structured and unstructured data for performingalerting, automation and reporting functions either on a desk top systemor enterprise wide.

In accordance with one embodiment, the system utilizes a two stageprocess. In the first stage, the system retrieves information ofinterest and stores the information in a location that is associatedwith a particular agent. In the second stage, the system presents a userwith an interface by which the user can query the index to finddocuments of interest.

As will surely be appreciated by the skilled artisan, the systems andmethods described above provide a tool that can be utilized to impart togenerally unstructured data, a structure that permits a robust andseemingly endless number of operations to be employed on thenow-structured data. The various approaches discussed above aregenerally much simpler and more flexible to other data disambiguationapproaches that have been utilized in the past. The examples providedbelow describe scenarios in which the technology described above can beemployed. These examples are not intended to limit application of theclaimed subject matter to the specific examples described below. Rather,the examples are intended to illustrate the flexibility that the toolsdescribed above provide.

EXAMPLE 1

One important area of application pertains to real time scenarios inwhich detection of patterns and appropriate response generation takesplace. As an example, consider a scenario in which law enforcementindividuals wish to search for potential child molesters in chat rooms.Given the vast expanse of cyberspace and the seemingly endless number ofchat rooms that serve as forums for child molesters, the task ofmonitoring these chat rooms and reacting to dialogs from potentialmolesters is a daunting one. One current approach is to assign a lawenforcement individual a small number of chat rooms and have themmonitor the chat room for suspicious dialogs. When a suspicious dialogis detected, the law enforcement individual can intervene and attempt toensnare the potential molester. There are limits on this approach, themost obvious of which is that only a small number of chat rooms can bemonitored by any one law enforcement individual. Given the budgetaryconstraints of many laws enforcement organizations, funds are often notavailable to place as many law enforcement individuals on this task asare necessary or desirable

Using the above-described systems and methods, agents can be deployed toessentially sit in multiple chat rooms and use knowledge bases tomonitor and process the dialog that takes place in the chat room. If andwhen problematic words or patterns are identified the agent can reactimmediately. In one instance, the agent might notify a human monitor,via an application such as application 608 (FIG. 6), that a pattern hasbeen detected, thus allowing the human monitor to join in theconversation in the chat room and participate in further law enforcementactivities. In another instance, the agent might be programmed to engagein a conversation a potential molester and, in parallel, generate analert for a human monitor.

In this particular instance, the inventive systems and methods are forcemultipliers in that the ratio of chat room-to-human monitors can bedramatically increased.

EXAMPLE 2

The systems and methods described above can be utilized to develop linksand relationships within generally unstructured data. In one example,links are built through proximities—where proximities can be subjectsthat appear in or at the same media, place, time, and the like. As anexample, consider that a subject of interest is “John Doe” and that JohnDoe is suspected of having a relationship with a certain person ofinterest “Mike Smith”. Yet to date, this relationship has been elusive.Consider now that a system, such as the system of FIG. 6, is set up withagents to monitor various data origination entities for informationassociated with John Doe and independently, Mike Smith. Assume thatduring the monitoring of the data origination entities, information isdeveloped that indicates that John Doe went to the Pentagon at 9 P.M.Assume also that in monitoring the data origination entities,information is developed that associates a time range from between 6P.M. and 11 P.M. with Mike Smith's presence in the Washington D.C. area.Once this information has been developed and processed by the inventivesystems described above, as by, for example, being formulated into atable such as the table shown in FIG. 5, a search can be conducted onthe table to establish a link or relationship between John Doe and MikeSmith. As noted above, the search can be defined as a simple key wordsearch of the table, or a more robust regular expression search of thetable.

EXAMPLE 3

Consider a travel related application in which a user is interested inbooking a vacation to a particular destination. Assume that a deployedagent now engages the user in a conversation at a web site that booksvacation trips. During the course of the conversation, the user types incertain responses to queries by the agent. For example, the agent mayascertain that the user wishes to book a vacation to Maui and isinterested in staying on the north side of the island. Responsive tolearning this information, the agent causes multimedia showing the northside of Maui to be presented to the user. As the conversation proceeds,the agent learns other information from the user such as various generalactivities in which the user likes to participate. For example, theagent may learn that the user likes to hike and explore. Responsive tolearning this information, the agent may then cause multimediaassociated with hiking and exploring on Maui to be presented to the useras the query continues. Needless to say, the systems and methodsdescribed above can be utilized to provide a flexibly robust,user-centric experience.

CONCLUSION

The embodiments described above provide a state-based, regularexpression parser in which data, such as generally unstructured text, isreceived into the system and undergoes a tokenization process whichpermits structure to be imparted to the data. Tokenization of the dataeffectively enables various patterns in the data to be identified. Insome embodiments, one or more components can utilize stimulus/responseparadigms to recognize and react to patterns in the data.

Although the invention has been described in language specific tostructural features and/or methodological steps, it is to be understoodthat the invention defined in the appended claims is not necessarilylimited to the specific features or steps described. Rather, thespecific features and steps are disclosed as preferred forms ofimplementing the claimed invention.

1. A computer-based system for determining a response to an input textstring, the system comprising: a server that receives the input textstring via a computer data network, wherein the server executes softwareinstructions stored on a computer readable medium, wherein the server isprogrammed to: tokenize the input text string by parsing the input textstring to define one or more recognizable patterns in the input textstring; compare the one or more recognizable patterns to a plurality ofcases of text to determine whether the one or more recognizable patternsmatch one or more of the plurality of cases, wherein each of theplurality of cases define a response to be taken in the event of a casematch, wherein the cases of text are stored in a knowledge base anddefined using a hierarchical tag-based markup language, wherein thehierarchical tag-based markup language comprises; an input tag thatidentifies a pattern of text to be matched, wherein the plurality ofcases of text comprises one or more input tags with exact text strings,one or more input tags with partial text string cases, and one or moreinput tags with variable text string cases; a response tag thatidentifies the response in the event of a case match, wherein theresponse comprises an output text expression to be output by the server;a previous tag associated with the output text expression, where theprevious tag constrains a case from matching a recognizable pattern whenthe server did not previously output an output text expression thatmatches the output expression associated with the previous tag; and aprevious input tag associated with an input expression, where theprevious input tag constrains a case from matching a recognizablepattern when the server did not previously receive from a user an inputtext expression that matches the input expression associated with theprevious input tag: when a recognizable pattern matches only one case,perform the response for the case; and when a recognizable patternmatches two or more cases: score the two or more cases to determine thecase with the highest probability match based on a scoring function, thescoring function scores exact text string case matches greater thanvariable text string case matches; and perform the response for the casewith the highest probability match.
 2. The system of claim 1, furthercomprising a table stored in a computer readable medium, wherein thetable is configured to contain a token assigned to the input textstring.
 3. The system of claim 1, further comprising: a deployableagent, deployable by the server, comprising a data source configured toprovide a pipeline for data to travel; and a runtime object configuredto receive and process data that travels through the pipeline, whereinthe data is text-based data.
 4. The system of claim 3, wherein thedeployable agent is an active agent configured to interact with a dataorigination entity.
 5. The system of claim 3, wherein the deployableagent is a passive agent configured to receive information from a dataorigination entity.
 6. A computer-based system for determining aresponse to an input text string, the system comprising: a server thatreceives the input text string via a computer data network, wherein theserver executes software instructions stored on a computer readablemedium, wherein the server comprises: one or more knowledge bases thatstore a plurality of cases of text, each case defining a response to betaken in the event of a case match, where the cases are defined using ahierarchical tag-based markup language; and a functional presence enginethat: tokenizes the input text string by parsing the input text stringto define one or more recognizable patterns in the input text string;compares the one or more recognizable patterns to the plurality of casesof text in the one or more knowledge bases to determine whether the oneor more recognizable patterns match one or more of the plurality ofcases, wherein the hierarchical tag-based markup language comprises; aninput tag that identifies a pattern of text to be matched, wherein theplurality of cases of text comprises one or more input tags with exacttext strings, one or more input tags with partial text string cases, andone or more input tags with variable text string cases; a response tagthat identifies the response in the event of a case match, wherein theresponse comprises an output text expression to be output by the server;a previous tag associated with an output text expression, where theprevious tag constrains a case from matching a recognizable pattern whenthe server did not previously output an output text expression thatmatches the output expression associated with the previous tag; and aprevious input tag associated with an input expression, where theprevious input tag constrains a case from matching a recognizablepattern when the server did not previously receive from a user an inputtext expression that matches the input expression associated with theprevious input tag, when a recognizable pattern matches only one case,performs the response for the case; and when a recognizable patternmatches two or more cases: scores the two or more cases to determine thecase with the highest probability match based on a scoring function,wherein the scoring function scores exact text string case matchesgreater than variable text string case matches; and performs theresponse for the case with the highest probability match.
 7. The systemof claim 6, further comprising a table stored in a computer readablemedium, wherein the table is configured to contain a token assigned tothe input text string by the functional presence engine.
 8. The systemof claim 6, further comprising: a deployable agent, deployable by theserver, comprising a data source configured to provide a pipeline fordata to travel; and a runtime object configured to receive and processdata that travels through the pipeline, wherein the data is text-baseddata.
 9. The system of claim 8, wherein the deployable agent is anactive agent configured to interact with a data origination entity. 10.The system of claim 8, wherein the deployable agent is a passive agentconfigured to receive information from a data origination entity.